Is the Jeffreys' scale a reliable tool for Bayesian model comparison in cosmology? 
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We are entering an era where progress in cosmology is driven by data, and alternative models will 
have to be compared and ruled out according to some consistent criterium. The most conservative 
and widely used approach is Bayesian model comparison. In this paper we explicitly calculate the 
Bayes factors for all models that are linear with respect to their parameters. We do this in order 
to test the so called Jeffreys' scale and determine analytically how accurate its predictions are in a 
simple case where we fully understand and can calculate everything analytically. We also discuss 
the case of nested models, e.g. one with Afi and another with A'h 3 Mi parameters and we derive 
analytic expressions for both the Bayes factor and the Figure of Merit, defined as the inverse area 
of the model parameter's confidence contours. With all this machinery and the use of an explicit 
example we demonstrate that the threshold nature of Jeffreys' scale is not a "one size fits all" reliable 
tool for model comparison and that it may lead to biased conclusions. Furthermore, we discuss the 
importance of choosing the right basis in the context of models that are linear with respect to their 
parameters and how that basis affects the parameter estimation and the derived constraints. 



I. INTRODUCTION 



Model comparison is at the forefront of modern science, especially in an age of huge datasets and several competing 
theories. Cosmology has entered an era where large amounts of data will be flowing in from CMB and LSS experiments 
like Planck p], BOSS [7, DES [T, COrE Euclid [5], etc. This clearly raises a fundamental question: Given some 
cosmological observations in the form of data and some cosmological models that may depend on one or more variables, 
how does one choose the best model? The reason for asking this is quite obvious. Perhaps the models correspond 
to predictions of different and competing fundamental theories that may explain a range of phenomena. One such 
example is the plethora of different Dark Energy and Modified Gravity models (see e.g. Ref. [6 for details) that fit 
the current cosmological observations more or less equally well with General Relativity, at least within the range of 
a few sigmas. 

A common way to answer this question has been by using Bayesian statistics, see Refs. [THIO]. For instance, the 
usual method of comparing minimum P^r effective degree of freedom normally misses the point and is not very 
decisive. Other methods to decide which model gives the best description, given the data, include various Information 
Criteria, e.g. Akaike [TT] and Bayesian [T2], which are more or less justified, see [T^, and normally do not compare 
well among each other. 

On the other hand, the Bayesian evidence is based on Bayes theorem, see Refs. [II], [H] for in-dcpth reviews, which 
relates the posterior distribution V{u, A4\T)) for the parameters u of the model Ai given the data D, in terms of the 
likelihood distribution function >C(D|it, A^) within a given set of priors n{u,Ai) 

where the likelihood can be obtained from £(D|it,A^) ~ exp(— x^(u)/2). Here E is the Bayesian evidence, i.e. the 
average likelihood over the priors, 

E{-D\M)^ J duC{B\u,M)n{u,M), (1.2) 

or roughly, the probability of the data being true given the model, integrated over the whole parameter range u as 
defined by the priors. The comparison of the models proceeds as the ratio of this quantity evaluated for the different 
models 
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This expression may naively be considered to provide a mathematical representation of Occam's razor, because more 
complex models tend to be less predictive, lowering their average likelihood (within the priors) in comparison with 
simpler, more predictive models. Complex models can only be favored if they are able to provide a significantly 



improved fit to the data. The Bayes factor (1.3) is then used to give evidence of (i.e. favor) the model Aii against 
the model A4j using the so-called Jeffreys' scale, a particular interpretation of the Bayes factor which strengthens its 
verdict roughly each time the logarithm InBij increases by one unit, from (undecisive) to greater than 5 (strongly 
ruled out). More details on the Jeffreys' scale can be found in later sections and specific threshold values in Table [lT| 
In Secti on im we explicitly calculate the Bayes factors for all models that are linear with respect to their parameters. 
In Section |III| we discuss the case of nested models, e.g. one with Mi and another with M2 parameters and we derive 
analytic expressions for both the Bayes factor while in Section |IV| we discuss the same problem for the Figure of 
Merit. With all this machinery and the use of the explicit example we demonstrate in Section [nT] that the Jeffreys' 



scale is not a "one size fits all" reliable tool for model comparison, contrary to the common belief by many people. 

II. GENERAL LINEAR LEAST-SQUARES FITTING 
A. Minimization 

In this section we will briefly discuss the case of the general linear least squares fitting. Given some data that consist 
of N measurements {xi, yi, cTi), where i = (1, 2, N), and a model which is a linear combination of M functions, 

Af-l 

y(a;)= ^ a,X,(x), (2.1) 

i=0 

then the fitness of the model with respect to the data and the parameters a^, for i = (0, 1, M — 1), can be found 
by calculating the x^(a) defined as [T5], [IT] 



- y[x^;aj} 



i—1 ^ 

^ /2A^E^loJ^i^i(^y ^2 2) 



The base Xi{x) can be any set of M functions, e.g. monomials like {x^jflp ^ Chebyshev polynomials {Ti{x)}ffj^^ , 
of order M. The latter are a set of orthogonal polynomials that can act as a base of functions with the property that 
when X £ [^1,1] they have the smallest maximum deviation from the true function at any given order M. The first 
few Chebyshev polynomials are To{x) = 1, Ti{x) = x, T2{x) = -1 + 2x'^, T3{x) = -3a; + Ax^. When x £ [-1, 1], 
the variable x can be written as a; = cos{9) and the polynomials can also be expressed as T„(cos(6')) = cos(n6') = 
cos(n arccos(a;)), which implies that |T„(x)| < 1. Since in general our data will not be in the range [—1, 1], we can 

normalize x by using x = — 1 and using instead the basis Tn(x) = T„(^^ 1), where Xmax is the maximum 

value of the N data Xi. From now on we will assume that x has been normalized and we will drop the tilde on x. 
Finally, we will mostly follow the notation of Ref . [T^ . 

The best-fit parameters of the model can be found by minimizing the x^C^) oi Eq. (2.2) with respect to the 
parameters aj. This is done by demanding that the derivatives of x^(a) are equal to zero at the minimum, i.e. 
djX^iak) = 0. Then, this gives [16], [17] 

^ ^ dak Xk{Xi)\ I Vi - J2m=0 "-rnXmiXi))] 



1=1 \ k=Q 



(-2)E 



Xj{Xi) I yi X/m=0 ^mXm{' 



_i 

1=1 



Xj 



= 



v,i=l / \ fc=0 'i=l 
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If we define the Fisher Matrix Fij and the constant vector /3j as 



F,, = %x'Un = E ^'^^"^f^'^^^^ = const. (2.4) 
fc=i '^'^ 

= ^ 2^^^ ^ ^o^g^^^ (2.5) 



where as usual j = (0, 1, M — 1), then Equation (2.3) can be rewritten in matrix form and easily solved for the 
best-fit parameters Smin as 

afc,min = ^kj^l^j = ^kjPj, (2.6) 

where Ckj = F^_^ is the covariance matrix. If we define the parameter Sy = X^i^i vf /'^f : then the at the minimum 
can be written as 

Finally, the la errors on the best-fit parameters are given by the diagonal elements of the covariance matrix 

a{akf = Ckk (2.8) 
At this point we should note that by doing a Taylor expansion around the minimum the (o) can also be written as 

X^{a) Xmin + (a - amin)i Fij (a - flniiiOj, (2.9) 

since the first derivatives at the minimum are by definition equal to zero and all higher derivatives 9" i„X^(a) for 
n > 3 vanish like in our model. 

B. Change of basis 



At this point we should note that the functional form of the results of Eqs. (2.6)-(2.9) is completely independent 
of the basis used, regardless of it being some combination of polynomials (monomials or Chebyshev polynomials) or 
something more complicated, e.g. sin(n x) etc. For example, in the case of the monomials the Fisher matrix is equal 
to 



fc=i 



k 



for i — (0, 1, M — 1). If at this point we define a constant 

N n 

(2.11) 



where for example 5*0 = Tli^^i = X^fcLi ^,82 — X^aLi ^ ^^'^ then in the case that M = 3 the Fisher 

matrix will be given by 

(5o Si S2 
Si S2 S3 \, (2.12) 
^2 S3 S4 

where the constants Sn will only depend on the data. 

For the Chebyshev polynomials the Fisher matrix is equal to 

1 CI 2| ■^-^ T.i(xk)Tj(xk) 

Fij — ^dijX Imin — / ^ 2 ' (2-13) 

^ fc=l '^k 
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For example, in the case that M = 3 the Fisher matrix will be given by 



F = 



2^fc=o 







To{xk)Ti{xk) 


l^k=0 


To(xk)T2{xk) 








To{xk)Ti{xk) 


l^k-- 


Ti(xkf 




Tiix^)T2{xk) 


To(xk)T2ixk) 
-I 


l^k=0 


Ti{xk)T2(xk) 
-I 


l^k = 


T2(xk? 

=0 -I 



(2.14) 



which obviously is constant and will only depend on the data at hand^. Similar expressions can be derived for other 
cases as well. 

In general, if we change basis from Xi{x) to some new polynomial basis Xi{x) of the same order, then we will have 
to replace the parameters au with some new parameters Ok , assuming that we still have the same number of linearly 
dependent parameters M . However, the function y{x) will still be the same, so 



y{x) = ^ a,Xi{x) 

4=0 

A/-1 
i=0 

Let one of the two bases satisfy an orthogonality relation with a weight w{x) 

dx w{x) Xi{x)Xj{x) ~ CiSij 



(2.15) 



(2.16) 



for some constants c^. Then, by using Eq. ( |2.15 ) and the orthogonality relation we can derive a transformation A 
between the two sets of parameters and for the other quantities of interest: 



% = ^Ik^kiMj 



l3^ 



Cij = AikCki^ij 

Finally, as it can easily be seen from Eq. (2.7) the Xmin expected is invariant under the transformation A. 



(2.17) 
(2.18) 
(2.19) 
(2.20) 



For example, we will now consider the change of basis from the monomials to the Chebyshev polynomials. In this 
case we will have 



M-l 



M-1 



2x 



1 



n=0 n=0 

Remembering the fact that the Chebyshev polynomials satisfy the orthogonality relation 

r,(z)r„(.) _ 

r- s- — '^n'-'nk 

vi — 



(2.21) 



(2.22) 



where = it and fc„ ~ 7r/2 for n>l. Then we can multiply Eq. (2.21) with the appropriate factors where 



z = 1 and by integrating both sides over z G [—1,1] we get the transformation between the two sets of 

parameters as 



(2.23) 



^ At this point we should remind the reader of our shorthand convention that x is normalized, so for example by writing To(xi,) we 
actually imply To (-^^* 1). 
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TABLE I: The determinant of the transformation matrix Akn for various combinations of polynomials including the monomials 
x", the Legendre polynomials P„{x) and the Chebyshev polynomials Tn{x). These values are particularly important in the 
estimation of the Figure of Merit, as shown in a later section. 



|A| 



Monomials Legendre Chebyshev 



Monomials 


1 


12 


^™ 
16 


Legendre 


12 


1 


3 
4 


Chebyshev 


16 


4 
3 


1 



where the constant matrix A is given by 

Afc„ ^K' J ^ dz ^ ^ (2.24) 

In Table |l] we show the determinant of the transformation matrix Afc„ for various combinations of polynomials 
including the monomials x", the Legendre polynomials Pn{x) and the Chebyshev polynomials T„(a;). These values 
are particularly important in the estimation of the Figure of Merit, as shown in a later section. Finally, we should 
stress that these results are only valid for a transformation from the original polynomial basis X{x) with a set of 
M linearly dependent parameters Oi to a new set with the same number of linearly dependent parameters M and a 
polynomial basis X{x). 



C. Likelihood calculations 



After having determined the best-fit parameters in the previous section, we will now define the likelihood for our 
model. This is defined as [TO] : 

£ = AAcxp(-x^(a)/2) , (2.25) 

where the parameter M can be found by normalizing the likelihood, ie integrating it over all parameters. In our case 
this means : 

/ Cda= [ 7Vexp(-x^(a)/2)rfa = 

J J —00 

/OO 
g-l/2(a-a„.„). F., i'^-'^^^'.^). dagdai . ..doM -i = 1 (2.26) 
-co 

To proceed we now have to rotate the parameters to a basis where they are not correlated with each other. To do 
so we define a new variable Si = Dij (oj — Oj^min), where Dij can be found by decomposing the inverse covariance 
matrix F = C^^ = D'^D by using Cholesky decomposition^. Then, we have that dsi...dsi\[ = \D\ dfi...dfpf and the 
integration can proceed as usual and the normalization can be found. Going to the new basis we have 

Si = Dij {oj - aj,min) (2.27) 
dsi...dspf — \D\daQdai...daM-i (2.28) 
\D\ = = |Cr^/^ (2.29) 

where is to be understood as the value of the jth parameter aj at its best-fit value (the "minimum"). Then, 



^ Cholesky decomposition can easily be implemented in computer programs such as Mathematica. For example, in the latter the Cholesky 
decomposition of a matrix Af = D is given by D = CholeskyDecomposition[h[]. This works both symbolically and numerically. 
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Eq. (2.261 becomes 

/ + 00 -L 

M-l „+oo 



i=t) 



e-'^™/^ iFfi/' (27r)^'^/2 ^ 



(2.30) 



and finally, 



AA = eX-n/2|^|i/2(2^)-M/2^ ^2.31) 
Unsurprisingly, the resulting normalized likelihood now becomes 

^ = (y]v772exp(-(x'(«)-^mi„)/2), (2.32) 

lj2 *^^P " Xmin)/2) , 



(2^)AV2 Id 

where in the last line we used the fact that |F| = jC*!"^. 

III. BAYESIAN MODEL COMPARISON 

A. The Jeffreys' scale revisited 

In this section we will present results related to the Bayes factor By-, where (see [IB] and references there-in), in 
the context of our simple model. In this case, the Bayes factor Bij can be written as 



where L{Mi) denotes the probability p{D\Mi), called likelihood for the model Mi, to obtain the data D if the model 
Mi is the true one. Generally, L{Mi) is defined as: 

L(M,) = p{D\Mi) ^ j da- p[a\Mi)C,{a) (3.2) 

for models with one free parameter and where p{a\Mi) is the prior probability for the parameter a. Also, Ci{a) is the 
likelihood for the parameter a in the model and 

A(a) = e-^^^"^'^ (3.3) 

In the case that a has flat prior probabilities, that is we have no prior information on a besides that it lies in some 
range [a, a + Aa] then p{a\Mi) = ^ and 

L{M,) = dae-^"^'^^/^ (3.4) 

The reason why we did not use the normalized likelihood is that the evidence L{Mi) has to be dimensionless in 



general. As it can be seen by inspecting Eq. (2.25), this is achieved since the units of the prior Aa will cancel out 
with the ones from the infinitesimal quantity da in the integral. However, this would not happen had we included the 
normalization constant Af. 

Of course, all this can be easily generalized for models having more than one parameter as follows 

^M-l \ „a+Aa 



e-x(^)/2.da (3.5) 
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TABLE II: The values of both the linear and the logarithmic Jeffreys' scale, and the AIC and BIC criteria. For references on 
these values check the text. 



Bij 


In By 


A(AIC) 


A(BIC) 


Evidence 


< 3 


< 1.1 


< 2 


< 2 


Weak 


< 20 


< 3 


< 6 


< 6 


Definite 


< 150 


< 5 


< 10 


< 10 


Strong 


> 150 


> 5 


> 10 


> 10 


Very Strong 



where M is the total number of parameters and the integration over da = YljLo^ daj = daedal. ..daM-i is assumed 
to be multidimensional in general. Also, we will consider Gaussian priors of the form: 

Pr{d) = e-^'^-'^-)' ia-a,.h/2, (3 

(2tt) 

where the M priors are centered around the values Opi-.i and Hij is their inverse covariance. Also, we have properly 
normalized the gaussian priors to unity, such that Pr{d)dd — 1. 

The interpretation of the Bayes factor Bij is that ^2] when 1 < Bij < 3 there is evidence against Mj when compared 
with Mi, but it is only worth a bare mention. When 3 < Bij < 20 the evidence against Mj is definite but not strong. 
For 20 < Bij < 150 the evidence is strong and for Bij > 150 it is very strong. For handy reference we include the 
values of both the linear and the logarithmic Jeffreys' scale in Table |llj Jeffreys in his seminal paper [ZKinj provides 
somewhat different but in general consistent values. Several examples of the use of the Jeffreys' scale in cosmology 
and astronomy can by found in [18l [20H22] and references there-in. 



B. The Bayesian evidence 

1. Gaussian priors 



Using the machinery of the previous sections we will now calculate the Bayesian evidence of Eq. (3.5) in the case 
of the Gaussian priors. Using Eqs. (3.2) and (3.6) we have: 



Bi = 



-\-oo 



e -x'(^)/2Fr(a) • da 



+ 00 



-(a-a„i„)i Fij (a-a„i„)j/2-(a-apr)i Hij (a-ap,)j/2 ^ 



At this point we can introduce a new matrix G^- and constants c and Ci ^ such that: 

Gij ^ij ~^ 

0-1,1 — {o,k,minFkj + O-k^pT^kj) G^j^ 

C = (amin — 0,i)i Fij (amin — ai)j + (flpi- — ai)i Hij (flpr — ai)j 

(a - ai)i Gij (a - ai)j + c = (a - flmin)* Fij {a - amin)j + (a - apr)t Hij {a - apr)-, 



(3.7) 



(3.8) 
(3.9) 
(3.10) 
(3.11) 



Obviously, when the best-fit and the priors are centered in the same point, ie ai^min = fli.pr , then we have that 
0-1 i = ai mill = Oi pr and c = 0. By using Eqs. (3.11 ) we can proceed with the calculation of (3.7) as usual: 



Bi = ^e-^^J^ 



+ 00 



(27r)Af/2 



-(a-ai)i Gij {a-ai)j /2-c/2 ^ gg 



(27r)M/2 



(27r)^ 



p-xLJ2-c/2\H^A^ 

e-xLj2-c/2^I^^^H-ip\-i 



12 



(3.12) 
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where in the second hne we used Eqs. ( 2.27 1-( 2.29), in the last Hne we used the well known matrix identity |X + A| = 
l-^^ll-^M where Im is the M x M unit matrix, and finally we used the fact that Gij = Hij + Fij. By using the 

matrix identity \A\ = ^ (tr{A)'^ — <r(A^)), where tr{A) is the trace of the matrix A, we can express the determinant 
of a sum of the unit matrix Im and a matrix B as 



Im + B\ = -(tr(lM 

M(M- 1) 



B)'-tr((lM 



{M ■ 



B) • (Im 
l)tr(B) + |B| 



B))) 



(3.13) 



and then, the expression \Im + H ^F\ can be expanded to 

\Im + H-'F\ 

Then the Bayes factor can be written as 



M{M - 1) 



(Af - l)tr(H"iF) + |H"^||F| 



Bi2 



-Ax?,2mi„/2-Aci,2/2 



■^(1)^(1)1 



-1/2 



-Ax?: 



„/2-Aci,2/2 



MiMi^ + (Afi - l)tr(H7iF(i)) + |H-i||F(i)| 



M2(M2-1) 

2 



[Ah - l)tr(H72 F 



(25^'(2)) + |H(25||F(2) 



-1/2 



(3.14) 



(3.15) 



where Axi^2min = Ximin ~ X2min ^^^^ Aci_2 = Ci — C2 are the values of the constant c for the two models. Also, in the 
last line we have labeled all the different quantities with (1) and (2) to indicate the two models 1 and 2 respectively. 
Finally, it should be noted that Eq. (3.15) is an exact result. 



2. Flat priors 



Alternatively, we can choose our priors to be top-hat and centered around the best-fit, ie we integrate in the range 
], with our flat top-hat priors being equal to Aa = const. G i?*^, where M is as usual the total 



AS 



AS 



number of parameters of the model. 



In this case then, by using Eq. (3.5), the evidence Bi for the model can be written as 

^M-l , \ p3„i„+Aa/2 



n 



1 



e-^^^^/^ ■ da 



an,i„-Aa/2 



n 



Aa, 



a„i„+Aa/2 



an,in-Aa/2 



„-(a-a„i„)i Fij (a-a,„i„)j/2 , 



(3.16) 



Using Eqs. (2.9) and ( 2.27 )-( 2.29) we can define a transverse of the Fisher matrix, F-^ = U ^FU, where U is the 
unitary off-diagonal A/-dimensional matrix, e.g. for M = 2, 

U = 



1 

1 



as well as the transverse of the Cholesky decomposition of F-^, denoted with the matrix D-^. With this, we can write 
((Slel) as 



V 2^2 



(3.17) 
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where we have used Eq. (2.31) and we have defined Vi^prior = Iljlo ^ ^^^P "volume" of our priors for this 

model 1. The function erf(x) is the usual error function defined as erf(x) = ^ Jo dt, see Ref. |23] for more 
details. Then the Bayes factor for two models based on Eq. (2.1), labeled 1 and 2 with a total number of parameters 
Ml and M2, is 

n.=o erf 



/V.T/o -l^'^O I 2\/2 
^ Jy2V2,pr^or V V 

lh=o eri ^ 

Also, in the last line we have labeled all the different quantities with a (...)''^-' to indicate model 1. At this point we 
will now consider two different cases: 

• When the priors are much smaller than the errors of the best-fit parameters, ie the arguments of the error 
functions are small, or ZJ^Aai ^ 1. 

• When the priors are much larger than the errors of the best-fit parameters, ie the arguments of the error functions 
are large, or £),^Aa.i :s> 1. 

The following expansions of the error function are useful: 

2V2x / 

erf(x) « 1 + ... for a; < 1 

v27r \ 3 / 



erf(x) « 1 - + ... for a; > 1 (3.19) 



In the first case (when a; ^ 1), the evidence Bi of Eq. ( 3.17[ ) becomes 



4! 



= e->^™/^ ( ^ " ^ ''4, + - j (3-20) 
where we have used the fact that Ujio^ Djj = \D-^\ = \F\^/^. Then the Bayes factor can be written as 



M2-I (, _ (Djf'Aaf^r 



=0 4! 



1 - E ' " ' + E ' " + - 1 (3-21) 



where Ax? 2min = Ximin ~ ximin' Also, in the last line we have labeled all the different quantities with (...)^^^ and 
{.■■)^'^^ to indicate the models 1 and 2 respectively. In this limit the second term of the expression is expected to be 
close to 1. 

In the second case (when a; ^ 1) the evidence becomes 
where in this limit the last term of the expression is expected to be close to 1. Then the Bayes factor is just B12 = 



10 



C. The information criteria 



The Akaike information criterion (AIC) is defined as 

AIC = -2lnC + 2M, (3.23) 

where C is the maximum hkehhood and M the number of parameters of the model [llj . |13) . Between two models 
Ml and M2, the better of the two, is the one which minimizes the AIC. Typically, models with too few parameters 
give a poor fit to the data and hence have a low log- likelihood or x^, while models with too many are penalized by 
the last term. 

On the other hand, the Bayesian information criterion (BIC) is defined as 

BIG = -2lnC + 2M In N, (3.24) 

where TV is the number of datapoints used in the fit. In either case however, the absolute value of the criterion 
is irrelevant and only the difference between the two models is important. Usually, differences A(AIC/BIC) < 2 
represent weak evidence, differences between 2 and 6 represent positive evidence, 6—10 strong evidence, and > 10 
very strong evidence [21] . At this point the astute reader might have noticed in Table [ll] that the threshold values 
for the two information criteria are twice a big as in the case of the logarithmic Bayes factor. The reason for this of 



course is the factors of 2 that appear in Eqs. (3.23) and (3.24 1. 



D. Analysis 



As it can be seen from Eqs. (3.18) and (3.21 ), in both cases (gaussian and flat priors) the Bayes factor for this class 
of models can be written as 



B12 



-AX?,2mi„/2 . 



G{Mi,M2) 



(3.25) 



where the function G{Mi, M2) contains all the extra information of the models via their covariance matrices. Then 
the logarithmic Bayes factor can be written as 



ln(i?i2) = -Ax?^2min/2 + ln(G(A^i , M2)) 



For example, in this order of the approximation and in the case of the flat priors Eq. (3.21 ) gives 



ln(Bi2) = -AxUin/2- E + E ^ 



(3.26) 



(3.27) 



As it can be seen, Eq. (3.27) does not only contain the difference between the Xmin ^^'^ models but also contains 

information on their covariances via the the last two terms. Clearly, these two terms depend strongly on the data and 
the model at hand, thus introducing a further complexity in the model comparison. 

To back up our claims we also present an explicit example. First, we created a set of 31 mock data points {xi, y-i, (Jy.) 
with noise in the range x G [0.025, 1.55] based on the same model as in [25] 



f{x) — a + {x — b) exp(— c a;^). 



(3.28) 



where the parameters (a, b, c) have the values (0.25, 0.25, 0.5) respectively. The choice of the model was completely ad 
hoc, except for the requirement to be well behaved (smooth) and that it exhibits some interesting features, such as a 
maximum at some point x. Then, we fit these data with two polynomials that have a different number of parameters 
Ml and M2, e.g. A/2 > Mi or Mi > M^- In other words, our model comparison is done between the models Mi and 
M2 and not between f{x) of Eq. (3.281 and a polynomial. 



In Fig. [T] we show the best-fit x as a function of the number of parameters M (top left), the best-fit per degree 
of freedom N — M as a function of the number of parameters M (top right), the difference in the best-fit between 
two models wtih parameters Mi = M + 1 and AI2 = M (bottom left) and the the difference in the best-fit x^ per 
degree of freedom between two models with parameters Mi = M + 1 and M2 = M (bottom right). While the absolute 
value of the x^ can be decreased almost arbitrarily by increasing the parameters M, the improvement at some point 
ceases to become relevant compared to a model with with M — 1 parameters, ie the fit does not become better with 
respect to a model with just one less parameter. Also, the best-fit x^ per degree of freedom N — M seems to have 
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FIG. 1: The best-fit a function of tlie number of parameters M (top left), the best-fit per degree of freedom N — M 

as a function of the number of parameters M (top right), the difference in the best-fit x^ between two models wtih parameters 



Ml = M + 1 and M2 
with parameters Mi = 



M (bottom left) and the the difference in the best-fit 
M + 1 and A/2 = M (bottom right). 



X per degree of freedom between two models 



a minimum, in this case for M = 13, which means that beyond that point adding more parameters just does not 
increase the quaUty of the fit. 

In Fig. [2|we s how contour plots of the Bayes factor logi3i2 of Eq. (3.261, calculated in the case of flat priors by 
using Eq. (3.18) when the priors are taken for consistency to be proportional to the errors of the best parameters 
Aflj = n CTj for n = 3 (left) and n = 7 (right). The red color corresponds to a high value for the Bayes factor log-Bi2, 
ie model Mi preferred, blue to a low (negative) value for the Bayes factor logi?i2, ie model M2 preferred, while white 
corresponds to equal evidence for both models. The relevant values of the Jeffreys' scale are given in Table [ll] 

As it can be seen, cases that one would normally expect for M2 to be ruled out, eg Mi = 4 and M2 — 14, as 
the difference in parameters is a staggering M2 — Mi = 10 thus giving M2 a big disadvantage, has a Bayes factor 
log B12 — 1.2 for n = 3 and is actually allowed by the Jeffrey's scale! Another similar example can be seen for Mi = 4, 
M2 = 10 and n — 7, see Fig. [2] on the right, where the Bayes factor is logi3i2 — meaning that these two models are 
totally equivalent! This simple example clearly demonstrates that the Jeffrey's scale is an inadequate tool for model 
comparison, since it completely fails even in this simple example. Also, the Bayes factor depends heavily on the size 
of the priors used, since as it can be seen in the two plots of Fig. [2] the results and the conclusions for the two models 
Ml and M2 are very sensitive in the choice of the priors Aa^ , something which is not yet again taken into account by 
the Jeffreys' scale. 

In Fig. [3] we show contour plots of the Bayes factor logi?i2 of Eq. (3.26), calculated in the case of gaussian priors 
by using Eq. ( |3.15 ) when the priors are taken for simplicity to be proportional to the errors of the best parameters 
Hii = n af and Hij = for i 7^ j, when n = 3 (left) and n — 7 (right). The red color corresponds to a high value for 
the Bayes factor logi3i2, ie model Mi preferred, blue to a low (negative) value for the Bayes factor log -812, ie model 
M2 preferred, while white corresponds to equal evidence for both models. The relevant values of the Jeffreys' scale 
are given in Table [llj We find similar results as in the case of the flat priors, ie models that should be excluded by 
the Jeffrey's scale are in fact allowed. 

Finally, we have explicitly checked our methodology with other models as well and we get similar results. Specifically, 
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FIG. 2: Contour plots of the Bayes factor log_Bi2 of Eq. (3.261, calculated in the case of flat priors by using Eq. KT^ when 
the priors are taken for simplicity to be proportional to the errors of the best parameters Aoi = n ai for n — 3 (left) and n = 7 
(right). The red color corresponds to a high value for the Bayes factor log-Bi2, ie model Mi preferred, blue to a low (negative) 
value for the Bayes factor log-Bi2, ie model M2 preferred, while white corresponds to equal evidence for both models. The 
relevant values of the Jeffreys' scale are given in Table [ill 



we also considered the case where the "real" model f{x) is a parabola with three parameters (a, 6, c): 

f{x) = a + b X + c x^, (3.29) 

and we again found the surprising result that cases that one would normally expect for M2 to be ruled out, eg Mi = 4 
and M2 = 14, as the difference in parameters is a staggering M2 — Mi = 10 thus giving M2 a big disadvantage, has a 
Bayes factor log B12 ~ 1 for n = 3 that is actually allowed by the Jeffrey's scale! So, even when the real model is a 
function with three parameters, the Jeffrey's scale fails to rule out a model with 14 parameters! 
To summarize, we firmly believe that the shortcomings of the Jeffreys' scale are the following: 

• It is completely insensitive to the size of the priors. 

• It fails to work as a "one size fits all" reliable tool for model comparison. 

• It provides misleading results even in apparently simple cases, such as the examples considered here. 



IV. FIGURE OF MERIT 



In this section we will calculate the Figure of Merit (FoM) for this general model by using the usual definition, but 
we will also introduce a new version which is instead useful for reconstructed quantities that are a function of x. 
The na contours are defined by the constraint equation 

C:xiaf^xLn + Sx^ (4-1) 

where the value of Sx'^ depends on the number of parameters M and the number n of desired as |16j . This important 
parameter Sx^ can be found by solving [16] 

1 - Q (M/2, 6x^2) = erf (n/V2) (4.2) 

for Sx^ > 0, where Q{a, z) is the regularized incomplete gamma function Q{a,z) = [23j. For various combina- 

tions of M and n we show it in Table I. Equation (4.2) can be solved for Sx'^{M, n) as: 

5x\M,n)=2g(^,l-eri(^\, (4.3) 



2 ' 
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FIG. 3: Contour plots of the Bayes factor log_Bi2 of Eq. (3.261, calculated in the case of gaussian priors by using Eq. (3.151 



when the priors are taken for simplicity to be proportional to the errors of the best parameters Ha = n and Hij 



for 



i ^ j, when n = 3 (left) and n — 7 (right). The red color corresponds to a high value for the Bayes factor log-Bi2, ie model Mi 
preferred, blue to a low (negative) value for the Bayes factor log B12 , ie model M2 preferred, while white corresponds to equal 
evidence for both models. The relevant values of the Jeffreys' scale are given in Table [H) 



where G is the inverse F regularized function'^. 

By using Eqs. (2.9 1 and ( 2.27 )-( 2.29) we can rewrite the constraint equation as follows: 



C : (a - Omin)* Fij {a - amin)j = Sx 

i=0 



(4.4) 



which in the rotated frame describes a hyper-sphere in M dimensions. 

The FoM is equal to the inverse of the volume inside the space whose boundary is given by Eq. (4.1 1 or in other 
words the constrained integral 



Vol(M) = / d^'a, 
Jc 

FoM = Vol(M)" 



(4.5) 



This is clearly done so that a smaller volume (better constraints) gives a higher FoM. From Eqs. (4.4) and ( 2.27 )-( 2.29) 
the volume can be expressed as 



Vol(M) 



= \F\-^/^Vm{5x^) 



= \F\-^'^ 



F(M/2 + 1) 



2^M/2 



(4.6) 



where in the last line we used the fact that the volume of a hyper-sphere of "radius" Rm = {6x'^) in M dimensions, 

,2\ _ tt""^^ /'X,.2\*^/2 



which is equal to the constrained integral in the new basis, is Va/ ((5x 



r(Af/2+i) 



,1/2 

Finally, 



FoMiM) = \F\^/^W^^Sxr"^' 



(4.7) 



^ This function can be calculated in Mathematica as Q(x,y) = InverseGammaRegularized[x,y] and works both symbolically and 
numerically to arbitrary precision. 
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TABLE III: The determinant of the transformation matrix Afe„ for various combinations of polynomials including the monomials 
a;", the Legendre polynomials P„{x) and the Chebyshev polynomials T„{x). Clearly, the Chebyshev polynomials provide the 
best constraints out of all three cases. 



PoMi{M) _ 1 . 1 
FoM2(M) l^^l 


Monomials Legendre Chebyshev 


Monomials 
Legendre 
Chebyshev 


1.000 0.296 0.222 
3.384 1.000 0.750 
4.511 1.250 1.000 



Now, suppose we want to compare two models that have Mi 
of the FoM for the two models can be written as 



M + SM and M2 = M parameters. Then, the ratio 



FoM(Mi) _ 
FoM(M2) 



1/2 



r{{M+5M)/2+l) 

^(M + SM)/2 



-(M+5M)/2 ^ 



|i^(2)|l/2 I 

Z{M, 5M) 



r(M/2+i) 



-M/2 



1)11/2 



|i^(2)|l/2 



where we have defined 



Z{M, SM) 



r(M/2+i) 

■rrM/2 



{SxHM,n)) 



-M/2 



(4.8) 



(4.9) 



The dependence of the function Z{M, 5M) on M and 5M for n = 1 (Icr) is shown in Fig. |4] As it can easily be 
seenfrom Eq. (4.9), Z{M^ SM) is completely independent of the data, depending solely on the number of parameters 



and the number n of as. If we want to study the dependence of the FoM of the number of parameters in the case of 
a nested model, eg when Mi = M + 1 and M2 = M, then it is convenient to define the function 



Ratio(A//) 



FoM(M + 1) 
FoM(M) 



(4.10) 



In Fig. [4] we show the dependence of Ratio(M ) on M, but normalized to Ratio(M = 2). Clearly, adding more 
parameters does not improve the FoM, when the two models differ just by one parameter, ie Mi — M2 = 1. 



We can also explore the dependence of the FoM on the different bases Xn{x), as shown in Section II B but now 
with the same number of parameters M. If we denote the FoM for basis 1 as FoMi(M) and the FoM for basis 2 as 
FoM2(M), then by using Eqs. (llTland (|2.19[) we have 



FoMi(Af) _ 
FoM2(M) ~ IF2I1/2 



|A| 



(4.11) 



III 



where we have assumed that Fi = F2 A. In the case of our example we have that Xmax = 1-55, so in Table 
show the ratio of the FoM between the different combinations of bases. Clearly, the Chebyshev polynomials provide 
the best constraints out of all three cases. 



V. CONCLUSIONS 



We are entering an era where progress in cosmology is driven by data, and alternative models will have to be 
compared and ruled out according to some consistent criterium. The most conservative and widely used approach is 
Bayesian model comparison. Naively, one expects the Bayes factor to act as a discriminant among competing models 
by penalizing those with a larger set of parameters. This has been the common use of Bayesian model comparison 
in cosmology in the last decade. However, by explicitly computing the Bayes factors for models that are linear with 
respect to their parameters, we have shown that more information is needed in order to discriminate among models. 
In particular, we have seen that the thresholds associated to the so called Jeffreys' scale are not as conclusive as 
most people think they are. We have determined how accurate its predictions are in a simple case where we fully 
understand and can calculate everything analytically. 
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FIG. 4: The dependence of the function Z{M,SM) given by Eq. (4.9 1 on M and 5M for n = 1 (Icr 
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FIG. 5: The dependence of Ratio(M) on M, but normahzed to Ratio(M = 2). 
improve the FoM, when the two models differ just by one parameter, ie Mi — M2 



Clearly, adding more parameters does not 
= 1. 



We presented our results on a test of the Jeffreys' scale, which is used for comparing models, by explicitly calculating 
the best fit parameters, the minimum and the Bayes factors for all models that are linear with respect to their 
parameters regardless of the basis used, like in Eq. (2.1 ). The calculation of the Bayes factor was done for both flat 



and gaussian priors and analytic formulas were derived in both cases. We also considered the case of changing basis 
by a transformation from the original polynomial basis X{x) with a set of M linearly dependent parameters Oi to a 
new set with the same number of linearly dependent parameters M and a polynomial basis X{x). Furthermore, we 
also discussed the case of nested models, eg one with Mi and another with M2 D Mi parameters and we derived 
analytic expressions for the Bayes factor, while in Section |IV] we discussed the same problem for the Figure of Merit. 
O ur ma in result for the logarithmic Bayes factor between two models that linearly depend on the parameters. 



Eq. (3.26), does not only contain the difference between the Xmin of the two models but also contains information 
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on their covariances. Unsurprisingly, the covariances depend strongly on the data and the model at hand, thus 
introducing a further complication in model comparison. Therefore, the Bayes factor cannot be the only discriminant 
among models and thus cannot be taken as a quantitative version of Occam's razor, which simply penalizes models 
with a larger number of parameters. It is model predictiveness, not model simplicity, which is "rewarded" in Bayesian 
model comparison [27] > ESI- We have shown this by studying analytically the Bayes factors for models with both a 
small and a large number of parameters to be constrained by the same mock data. In particular, we found that models 
with Ml = 4 and M2 = 14 parameters had similar Bayes factor (lni?i2 1), and thus were "undecided" by Jeffreys' 
scale. This could only be understood if the extra 10 parameters do not contribute to the improvement of the model 
as a description of the data, irrespective of the fact that none of them, Mi nor M2, are the "true" model. In fact, 
even though the one with 14 parameters gets a better of course. However, this information is not contained in 
the Jeffrey's scale, and one could thus be fooled by the Bayes factor to assign similar probabilities to both models. 

Another similar example can be seen for Afi — 4, M2 = 10 and n = 7, see Fig.|2]on the right, where the Bayes factor 
is logi?i2 = meaning that these two models are totally equivalent! This simple example clearly demonstrates that 
the Jeffrey's scale is an inadequate tool for model comparison, since it completely fails even in this simple example. 
Also, the Bayes factor depends heavily on the size of the priors used, since as it can be seen in the two plots of Fig. [2] 
the results and the conclusions for the two models Mi and M2 are very sensitive in the choice of the priors Aa^, 
something which is not yet again taken into account by the Jeffreys' scale. 

To conclude, while the Bayes factors are clearly related to the probabilities that one of the two models are more 
likely than the other, the threshold values of Table [III of the Jeffreys' scale used to reject a model in favor of another, 
are open to interpretation. To make it more clear, the problem is not with the probabilistic interpretation of the 
Bayes factor, but with the Jeffreys' scale itself. The latter, just represents the threshold after which one is forced to 
reject a model, usually when logi3i2 > 5 (i.e. strong evidence). What we found was even when one would expect that 
a model with 14 parameters would and should be ruled out with respect to one with 4 parameters, it was allowed 
according to the Jeffreys' scale, since logi3i2 ~ l! 

Obviously, having an ad hoc scale for model comparison where the thresholds are the same irrespectively of the 
models and the data, used as a "one size fits all" tool, can lead to biased conclusions. The situation can potentially 
be even worse in cases where the models are not as simple as in the one at hand, e.g. consider the case in cosmology 
where the models used are non-linear and substantially more complicated [B]. 
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